92 research outputs found

    Gestion des réseaux multi-grappes hétérogènes avec la bibliothèque Madeleine III

    Get PDF
    This paper introduces the new version of the Madeleine portable multi-protocol communication library. Madeleine version III now includes full, flexible multi-cluster support associated to a redesigned version of the transparent multi-network message forwarding mechanism. Madeleine III works together with a new configuration management module to handle a wide panel of network-heterogeneous multi-cluster configurations. The integration of a new topology information system allows programmers of parallel computing applications to build highly optimized distributed algorithms on top of the transparent multi-network communication system provided by Madeleine III's virtual networks. The preliminary experiments we conducted regarding the new virtual network capabilities of Madeleine III showed interesting results with an asymptotic bandwidth of 43 MB/s over a virtual link made of a SISCI/SCI and a BIP/Myrinet physical link

    NetIbis: An Efficient and Dynamic Communication System for Heterogeneous Grids

    Get PDF
    Grids are more heterogeneous and dynamic than traditional parallel or distributed systems, both in terms of processors and of interconnects. A grid communication system must handle many issues: first, it must run on networks that are not yet determined when the application is launched, including user-space interconnects; second, it must transparently run on different networks at the same time; third, it should yield performance close to that of specialized communication systems. In this paper, we present NetIbis, a new Java communication system that provides a uniform interface for any underlying intercluster or intracluster network. NetIbis solves the heterogeneity issues posed by Grid computing by dynamically constructing network protocol stacks out of drivers, self-contained building blocks for flexible configuration, with limited functionality per driver. We describe the design and implementation of the major NetIbis drivers for serialization, multicast, reliability, and various underlying networks. We also describe various optimizations for performance, like layer collapsing for the GM driver. We evaluate the performance of NetIbis on several platforms, including a European grid

    Short Paper : Dynamic Optimization of Communications over High Speed Networks

    Get PDF
    International audienceWe present a new communication subsystem for high speed networks featuring an extendable packet optimization engine mixing several communication flows. Optimizations are parameterized by the capabilities of the underlying network drivers, and are triggered by the network cards when they become idle. The database of predefined strategies can be easily extended

    NewMadeleine : ordonnancement et optimisation de schemas de communication haute performance.

    Get PDF
    National audienceMalgré les progrès spectaculaires accomplis par les interfaces de communication pour réseaux rapides ces quinze dernières années, de nombreuses optimisations potentielles échappent encore aux bibliothèques de communication. La faute en revient principalement à une conception focalisée sur la réduction à l'extrême du chemin critique afin de minimiser la latence. Dans cet article, nous présentons une nouvelle architecture de bibliothèque de communication bâtie autour d'un puissant moteur d'optimisation des transferts dont l'activité s'accorde avec celle des cartes réseau. Le code des stratégies d'optimisations est générique et portable, et il est paramétré à l'exécution par les capacités des pilotes réseau sous-jacents. La base de données des stratégies d'optimisation prédéfinies est facilement extensible. L'ordonnanceur est en outre capable de mixer de façon globalisée de multiples flux logiques sur une ou plusieurs cartes physiques, potentiellement de technologies différentes en multi-rail hétérogène

    Task-Based Performance Portability in HPC: Maximising long-term investments in a fast evolving, complex and heterogeneous HPC landscape

    Get PDF
    White paperInternational audienceAs HPC hardware continues to evolve and diversify and workloads become more dynamic and complex, applications need to be expressed in a way that facilitates high performance across a range of hardware and situations. The main application code should be platform-independent, malleable and asynchronous with an open, clean, stable and dependable interface between the higher levels of the application, library or programming model and the kernels and software layers tuned for the machine. The platform-independent part should avoid direct references to specific resources and their availability, and instead provide the information needed to optimise behaviour.This paper summarises how task abstraction, which first appeared in the 1990s and is already mainstream in HPC, should be the basis for a composable and dynamic performance-portable interface. It outlines the innovations that are required in the programming model and runtime layers, and highlights the need for a greater degree of trust among application developers in the ability of the underlying software layers to extract full performance. These steps will help realise the vision for performance portability across current and future architectures and problems

    Beyond Gbps Turbo Decoder on Multi-Core CPUs

    Get PDF
    International audienceThis paper presents a high-throughput implementation of a portable software turbo decoder. The code is optimized for traditional multi-core CPUs (like x86) and it is based on the Enhanced max-log-MAP turbo decoding variant. The code follows the LTE-Advanced specification. The key of the high performance comes from an inter-frame SIMD strategy combined with a fixed-point representation. Our results show that proposed multi-core CPU implementation of turbo-decoders is a challenging alternative to GPU implementation in terms of throughput and energy efficiency. On a high-end processor, our software turbo-decoder exceeds 1 Gbps information throughput for all rate-1/3 LTE codes with K < 4096

    Bridging the gap between OpenMP and task-based runtime systems for the fast multipole method

    Get PDF
    International audienceWith the advent of complex modern architectures, the low-level paradigms long considered sufficient to build High Performance Computing (HPC) numerical codes have met their limits. Achieving efficiency, ensuring portability, while preserving programming tractability on such hardware prompted the HPC community to design new, higher level paradigms while relying on runtime systems to maintain performance. However, the common weakness of these projects is to deeply tie applications to specific expert-only runtime system APIs. The OpenMP specification, which aims at providing common parallel programming means for shared-memory platforms, appears as a good candidate to address this issue thanks to the latest task-based constructs introduced in its revision 4.0. The goal of this paper is to assess the effectiveness and limits of this support for designing a high-performance numerical library, ScalFMM, implementing the fast multipole method (FMM) that we have deeply redesigned with respect to the most advanced features provided by OpenMP 4. We show that OpenMP 4 allows for significant performance improvements over previous OpenMP revisions on recent multicore processors and that extensions to the 4.0 standard allow for strongly improving the performance, bridging the gap with the very high performance that was so far reserved to expert-only runtime system APIs

    Combler l'écart de performance entre OpenMP 4.0 et les moteurs d'exécution pour la méthode des multipoles rapide

    Get PDF
    With the advent of complex modern architectures, the low-levelparadigms long considered sufficient to build High Performance Computing (HPC)numerical codes have met their limits. Achieving efficiency, ensuringportability, while preserving programming tractability on such hardwareprompted the HPC community to design new, higher level paradigms.The successful ports of fully-featured numerical libraries on severalrecent runtime system proposals have shown, indeed, the benefit oftask-based parallelism models in terms of performance portability oncomplex platforms. However, the common weakness of these projects is todeeply tie applications to specific expert-only runtime system APIs. The\omp specification, which aims at providing a common parallelprogramming means for shared-memory platforms, appears as a goodcandidate to address this issue thanks to the latest task-basedconstructs introduced as part of its revision 4.0.The goal of this paper is to assess the effectiveness and limits ofthis support for designing a high-performance numerical library. Weillustrate our discussion with the \scalfmm library, which implementsstate-of-the-art fast multipole method (FMM) algorithms, that wehave deeply re-designed with respect to the most advancedfeatures provided by \omp 4. We show that \omp 4 allows forsignificant performance improvements over previous \omp revisions onrecent multicore processors. We furthermore propose extensions to the\omp 4 standard and show how they can enhance FMM performance. Toassess our statement, we have implemented this support within the\klanglong source-to-source compiler that translates \omp directives intocalls to the \starpu task-based runtime system. This study shows thatwe can take advantage of the advanced capabilities of a fully-featuredruntime system without resorting to a specific, native runtime port,hence bridging the gap between the \omp standard and the very highperformance that was so far reserved to expert-only runtime systemAPIs.Avec l'arrivée des architectures modernes complexes, les paradigmes de parallélisation de bas niveau, longtemps considérés comme suffisant pour développer des codes numériques efficaces, ont montré leurs limites. Obtenir de l'efficacité et assurer la portabilité tout en maintenant une bonne flexibilité de programmation sur de telles architectures ont incité la communauté du calcul haute performance (HPC) à concevoir de nouveaux paradigmes de plus haut niveau.Les portages réussis de bibliothèques numériques sur plusieurs moteurs exécution récentos ont montré l'avantage des modèles de parallélisme à base de tâche en ce qui concerne la portabilité et la performance sur ces plateformes complexes. Cependant, la faiblesse de tous ces projets est de fortement coupler les applications aux experts des API des moteurs d'exécution.La spécification d'\omp, qui vise à fournir un modèle de programmation parallèle unique pour les plates-formes à mémoire partagée, semble être un bon candidat pour résoudre ce problème. Notamment, en raison des améliorations apportées à l’expressivité du modèle en tâches présentées dans sa révision 4.0.Le but de ce papier est d'évaluer l'efficacité et les limites de ce modèle pour concevoir une bibliothèque numérique performante. Nous illustrons notre discussion avec la bibliothèque \scalfmm, qui implémente les algorithmes les plus récents de la méthode des multipôles rapide (FMM). Nous avons finement adapté ces derniers pour prendre en compte les caractéristiques les plus avancées fournies par \omp4. Nous montrons qu'\omp4 donne de meilleures performances par rapport aux versions précédentes d'\omp pour les processeurs multi-coeurs récents. De plus, nous proposons des extensions au standard d’\omp4 et nous montrons comment elles peuvent améliorer la performance de la FMM. Pour évaluer notre propos, nous avons mis en oeuvre ces extensions dans le compilateur source-à-source \klanglong qui traduit les directives \omp en des appels au moteur d'exécution à base de tâches \starpu. Cette étude montre que nous pouvons tirer profit des capacités avancées du moteur d'exécution sans devoir recourir à un portage sur l'API spécifique de celui-ci.%d'un moteur d'exécution. %prouve que nous pouvons tirer profit des capacités avancées du moteur d'exécution sans recourir à un portage spécifique dans le moteur d’exécution. Par conséquent, on comble le fossé entre le standard \omp et l’approche très performante par moteur d’exécution qui est de loin réservée au seul expert son API

    Decentralized in-order execution of a sequential task-based code for shared-memory architectures

    Get PDF
    International audienceThe hardware complexity of modern machines makes the design of adequate programming models crucial for jointly ensuring performance, portability, and productivity in high-performance computing (HPC). Sequential task-based programming models paired with advanced runtime systems allow the programmer to write a sequential algorithm independently of the hardware architecture in a productive and portable manner, and let a third party software layer-the runtime system-deal with the burden of scheduling a correct, parallel execution of that algorithm to ensure performance. Many HPC algorithms have successfully been implemented following this paradigm, as a testimony of its effectiveness. Developing algorithms that specifically require fine-grained tasks along this model is still considered prohibitive, however, due to per-task management overhead [1], forcing the programmer to resort to a less abstract, and hence more complex "task+X" model. We thus investigate the possibility to offer a tailored execution model, trading dynamic mapping for efficiency by using a decentralized, conservative in-order execution of the task flow, while preserving the benefits of relying on the sequential taskbased programming model. We propose a formal specification of the execution model as well as a prototype implementation, which we assess on a shared-memory multicore architecture with several synthetic workloads. The results show that under the condition of a proper task mapping supplied by the programmer, the pressure on the runtime system is significantly reduced and the execution of fine-grained task flows is much more efficient

    Evaluation of OpenMP Dependent Tasks with the KASTORS Benchmark Suite

    Get PDF
    International audienceThe recent introduction of task dependencies in the OpenMP specifi-cation provides new ways of synchronizing tasks. Application programmers can now describe the data a task will read as input and write as output, letting the runtime system resolve fine-grain dependencies between tasks to decide which task should execute next. Such an approach should scale better than the excessive global synchronization found in most OpenMP 3.0 applications. As promising as it looks however, any new feature needs proper evaluation to encourage applica-tion programmers to embrace it. This paper introduces the KASTORS benchmark suite designed to evaluate OpenMP tasks dependencies. We modified state-of-the-art OpenMP 3.0 benchmarks and data-flow parallel linear algebra kernels to make use of tasks dependencies. Learning from this experience, we propose extensions to the current OpenMP specification to improve the expressiveness of dependen-cies. We eventually evaluate both the GCC/libGOMP and the CLANG/libIOMP implementations of OpenMP 4.0 on our KASTORS suite, demonstrating the in-terest of task dependencies compared to taskwait-based approaches
    corecore